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Abstract. A new measurement of the anomalous magnetic moment of the muon, = 
{g — 2)/2, will be performed at the Fermi National Accelerator Laboratory. The most recent 
measurement, performed at Brookhaven National Laboratory and completed in 2001, shows a 
3.3-3.6 standard deviation discrepancy with the Standard Model predictions for a^. The new 
measurement will accumulate 21 times those statistics, measuring to 140 ppb and reducing 
the uncertainty by a factor of 4. 

The data acquisition system for this experiment must have the ability to record deadtime-free 
records from 700 jis muon spills at a raw data rate of 18 GB per second. Data will be collected 
using 1296 channels of /xTCA-based 800 MHz, 12 bit waveform digitizers and processed in a 
layered array of networked commodity processors with 24 GPUs working in parallel to perform 
a fast recording and processing of detector signals during the spill. The system will be controlled 
using the MIDAS data acquisition software package. The described data acquisition system is 
currently being constructed, and will be fully operational before the start of the experiment in 
2017. 


1. Introduction 

In the Dirac theory, the muon is a spin 1/2 pointlike particle. It has a magnetic moment given 
by Eq. ^ 

Qe 
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where g was predicted by Dirac to be identically equal to 2 [^. Contributions from QED, weak 
and hadronic loops, move the Standard Model’s predicted value of g very slightly away from 2, 
so it has become customary to measure the so called muon anomaly If a discrepancy 

with the Standard Model is found, further contributions to could come from SUSY, dark 
photons, or other new physics. Ongoing theoretical and experiment efforts are improving the 
precision of the Standard Model prediction for a^. See for example the first lattice calculation 
for the light-by-light scattering contribution to the hadronic term [^, a computation of the tenth 
order QED contribution to g-2 {a^{QED) — 116584718951(80) x 10“^^) [^, and an analysis to 
reevaluate the hadronic contribution to g-2 using data from KLOE and BABAR (giving a 3.6cr 
or 2.4(7 discrepancy from the experimental result depending on the measurement method) [^. 

The muon anomaly was measured most recently by Brookhaven National Lab experiment 
E821 [^, and showed a, ^ 3a discrepancy with the Standard Model. The new muon g-2 
experiment at Fermilab, E989, will measure 21 times the number of muon decays, reducing 
the uncertainty on this measurement by a factor of four, and plans to begin taking data in early 


2017 [^. Without theoretical improvements, the discrepancy with the Standard Model 
could reach > 5(j, as shown in Fig. The increase in data rate required to make this 
measurement requires the use of a state of the art data acquisition system. 

The muon anomaly will be obtained by measuring the precession of muons in a magnetic field. 
The value of g-2 will be extracted from precise measurements of the difference between the 
cyclotron frequency of the muon and the spin precession frequency, and the magnetic field S, 
which are related to with the relationship in Eq. 

( 2 ) 

m 

The measurement will be performed using a superconducting magnetic ring that was recently 
relocated from Brookhaven National Lab in Upton, NY to Fermilab in Batavia, IL. The ring has 
recently been installed into the new MC-1 building at Fermilab, the steel yokes have all been 
attached, and cooling of the magnet has begun. The current status of the experiment has been 
recently reviewed with greater detail in several articles 



Figure 1. Comparison of expected 
uncertainties on new muon g-2 
measurement to those of BNL-E821 
and Standard Model predictions. 


Fills of polarized 3.1 GeV/c muons will be injected into the storage ring at a rate of 12 
Hz. The muons will decay into positrons, which will be detected by a suite of calorimeters and 
tracking detectors in the interior of the ring. Each of the 24 calorimeters is composed of a 9 x 6 
array of PbF 2 crystals with silicon photomultiplier readout f^, giving 1296 total channels. The 
number of muons per 700 /iS fill is expected to be 1.0 x lOybut with an estimated acceptance 
of 10.7%, we expect to measure 1.1 x 10^ muons per fill (at an energy > 1.86 GeV). 

The calorimeter signals will be digitized using custom /iTCA advanced mezzanine cards 
(AMCs) built at Cornell University. Each AMC is a 5 channel, 12 bit, 800 MHz waveform 
digitizer (WFD), the data of which is transmitted through the /iTCA backplane at an expected 
instantaneous to an AMC13 control card in the /iTCA shelf where the data is aggregated 
and then read out asynchronously over lOGbE fiber by the data acquisition system. The /iTCA 
shelf is controlled via IPBus commands sent via a Vadatech MCH. The total data rate from the 
calorimeters to the DAQ is expected to be 17.4 GB/s given by 1296 channels x 2 Bytes per 
sample x 560000 samples per fill x 12 fills per second = 17.4 GB/s. 

The system will be calibrated utilizing a laser calibration system that will operate 
continuously during the data taking of the experiment, as well as for dedicated high rate 





calibration runs [^. This high trigger rate data must also be accommodated in the data 
acquisition system. 

2. Data Acquisition System 

The g-2 DAQ is being built using the MIDAS data acquisition software package, which was 
developed at PSI and widely used at PSI, TRIUMF, and other labs. The design of the DAQ is 
based largely on prior experience of the collaboration, in particular to the MuLan DAQ [^. The 
physical system will be composed of a layered array of networked commodity processors with 
GPUs for parallelization of the data processing. The system must have the ability to provide a 
deadtime free record of each 700 /iS muon fill at a rate of 12 fills per second with a minimum 
of 11 ms fill separation, providing a total data rate of 18 GB/s. It will process data from 1296 
calorimeter channels, 3 straw tracker stations, and multiple auxiliary detectors. In addition to 
the derived datasets, the system will have the ability to write raw data for a fraction of the 
muon fills. 

MIDAS provides a convenient web interface for control of the experiment, as shown in Fig. 
as well as the framework for an event builder and data logger, which will output data in a MIDAS 
binary data format, which can subsequently be processed into ROOT trees and analyzed in the 
art framework that is developed at Fermilab, described in these proceedings by R. Kutschke. 

MIDAS also provides an online database (ODB) used both for saving the configuration of the 
experiment from run-to-run and also for control of the detectors, as settings that are changed in 
the ODB are hot-coded to update via the frontend processes in real time. Thus it is possible to 
change hardware configurations of a WFD component or set the HV on a detector directly via 
the MIDAS ODB. The values from the ODB can also be stored long term via the MIDAS slow 
control bus (MSCB), which will write periodic slow control data to a PostgreSQL database. 

The frontends for the experiment are written in C/C++. A master frontend controls the 
other frontends via remote procedure calls and synchronizes the data to the muon fill structure, 
as well as recording a GPS timestamp that will enable subsequent synchronization with the 
magnetic field data. Additionally a single frontend process will be running to collect data for 
each detector or detector group. There will be 24 calorimeter frontends, one tracker frontend 
accumulating data from the three straw tracking detectors, as well as several frontends running 
to acquire data from the auxiliary detectors such as stored beam monitoring, entrance counters, 
and electric quadrupole monitors. All of these detectors will be read out via WFDs in /iTCA 
crates, with the exception of the trackers which are read out via /iTCA-based TDCs, but still 
controlled via an AMC13 with IPBus. This enables the development of a single frontend code 
for the acquisition of data from an AMC13 via lOGbE to be similarly utilized for all detectors 
in the experiment with only a few options for customization between detectors. 

The key difference in the processing between the high-rate data coming from the calorimeters 
and the relatively low-rate data coming from the trackers and auxiliary detectors, is that the 
calorimeter data will be processed using a hybrid system of multicore CPUs and graphical 
processing units (GPUs). Data from each of the 24 calorimeters will be processed into multiple 
derived datasets in a single NVIDIA Tesla K40 GPU. The K40 GPU utilized 2880 cores to 
provide a vast increase in parallelization over what would be available in traditional CPU-based 
data processing. The K40 also has a 12GB onboard memory and a memory bandwidth of 288 
GB/s. Data is transfered between the system memory and the GPU via PCIe version 3.0, which 
allows for a significan increase in bandwidth over the the older Tesla K20 GPU and PCIe version 
2.0, as shown in Table. 

The CPU multithreading is handled using mutual exclusion (mutex) locks, which help to 
insure data integrity by dividing the processing for each fill into three distinct threads. A TCP 
thread is responsible for reading data from the TCP socket, unpacking the data, and copying 
it to a ring buffer. The GPU thread then performs an asynchronous memory copy to send the 


data to the GPU, launches the data processing, and the copies the data back to the system 
memory. The MIDAS thread packs and sends the data into the MIDAS banks and data quality 
monitoring system. 

The GPU multithreading is programmed and optimized using the NVIDIA CUDA libraries. 
The CUDA code reads the raw digitizer data and processes it into two derived datasets, referred 
to as the T and Q-methods. The T-method is a standard data taking method in which individual 
positron hits in the calorimeter are identified. The data is processed into islands of digitizer data 
where the signal surpasses a given threshold, and the energy of each hit can be measured. The 
anomalous precession frequency cja is then extracted from a pileup-subtracted time-spectrum. 
This was the process utilized by the Brookhaven E821 experiment. The main drawback to this 
method is that pileup causes an early-to-late phase shift in which is a significant contribution 
to the systematic uncertainty. An alternative derived dataset is the so-called Q-method. In the 
Q-method, individual events are not identified, but instead the detector current is integrated 
as a function of time, and oJa can subsequently be extracted from this time-distribution. This 
method is attractive because it is hoped to be immune to pileup-related uncertainties. Both of 
these methods can be implemented simultaneously in a single GPU. 

Fig-i describes the accumulation of data from the frontend processes by the event builder, 
which will be hosted on a backend machine, and is used to assemble memory fragments from 
the various frontend processes into a single deadtime free record of each 700 /iS fill. The data 
will then be copied to the Fermilab central archive via a dedicated 20 Gb/s connection, and 
also monitored in real time using ROME, a package that interfaces with MIDAS to display 
ROOT-based analysis in real time for data quality monitoring. The total data output of the 
experiment is expected to be ~ 2 PB. 



Equipment 


Equipment Status Events Events[/s] Data[MB/s] 


MasterGM2 [ MasterGM2@wildcat 

J 4293 

12.0 

0.001 

AMC13Simulator01 (AMC13Simulator01@rave0ll 4309 

12.0 

0.000 

AMC1301 1 AMC1301@rave01 

1 4308 

12.0 

1.714 

AMC13Simulator02 (AMC13Simulator02@rave01 ] 4306 

12.0 

0.000 

AMC1302 1 AMC1302@rave01 

] 4303 

12.0 

1.720 

EB [ Ebuilder@wildcat 

V ' - 

J 4287 

11.9 

3.424 





Clients 


mhttpd [wildcat] Logger [wildcat] Ebuilder [wildcat] 

MasterGM2 [wildcat] AMC13Simulator01 [raveOl] AMC1301 [raveOl] 

AMC13Simulator02 [raveOl] AMC1302 [raveOl] 


Figure 2. The MIDAS 

web interface showing two 
calorimeter readout frontends 
running at full expected data 
rate. 


3. Prototyping 

With the start of data taking for the Fermilab muon g-2 experiment planned for the beginning 
of 2017, the data acquisition system will be fully constructed and operational by mid-2016. A 

































Figure 3. The DAQ for g-2 will be composed of a layered array of processors, with a frontend 
layer consisting of a hybrid system of GPUs with multi-core CPUs and a backend system used 
to assemble event fragments and control the experiment. 


Table 1. The GPU data transfer bandwidth was measured with different PCIe versions using 
Tesla K20 and K40 GPUs. The maximum bandwidth of PCIe version 2.0 is quoted as 500 MB/s 
per lane and for PCIe version 3.0 is 984.6 MB/s per lane. Tesla GPUs each utilize 16 lanes of 
PCIe. The memory is copied either as pageable or pinned, which is memory that cannot be 
swapped. 


PCIe Version 

GPU 

Host to device. Pageable 

Host to device. Pinned 

2 

K20 

3326.6 MB/s 

5028.3 MB/s 

3 

K20 

5628.6 MB/s 

6003.6 MB/s 

3 

K40 

6647.8 MB/s 

10044.3 MB/s 


phased build approach is being utilized. A system comprised of a backend, frontend, gateway, 
and /iTCA shelf with AMC13 has been assembled and is currently being used for prototyping. 
The system will be expanded to 25% of its full strength in mid-2015, and 50% by the end of 
2015. Prototyping is currently underway using simulated data and readout from the AMC13 
over 10 GbE, and the full system will be tested with the laser calibration system well before the 
start of data taking. 

The system is currently being prototyped in several test stands. Two test stands for the 
calorimeter aspect of the DAQ are operating by the University of Kentucky, a test stand for 
the tracker readout is set up at University College of London, and the data quality monitoring 
system is being developed at the Joint Institute for Nuclear Research in Dubna. The following 














































will focus primarily on prototyping of the calorimeter DAQ. 

The two test stands used for prototyping are essentially the same equipment, but one is 
set up at the University of Kentucky and the other is set up at Fermilab and operated by the 
University of Kentucky. Both utilize a single frontend machine and a backend machine connected 
by lOGbE. Both test stands also currently have a single /iTCA crate with an AMC13 control 
module and MCH. 

The Fermilab test stand is set up with two machines that are intended to be included in the 
final data acquisition system. Both computers are using Intel Xeon processors. The backend 
utilizes an 8-core processor with a 20MB cache, and the frontend uses a 6-core processor with 
a 12 MB cache. The backend contains 64Gb of 2133 MHz DDR4 EGG RAM, and the frontend 
has 32 GB of 1600 MHz DDR3. The frontend machine also houses the NVIDIA Tesla K40 GPU. 
We are experimenting with a setup using two K40 GPUs in one machine, so to test this we are 
also running the frontend with an additional K20 GPU, but that would be replaced with a K40 
if this setup is chosen for running during the experiment. 

Since the WFDs are still under development, much of the DAQ prototyping has utilized a 
simulator that generates fake data and packages it as would the real AMC13. A data generator 
makes simulated waveforms for each of the 54 channels of a calorimeter, as shown in Fig.|^ The 
simulator, which runs as an additional MIDAS frontend called AMClSSimulator^ then packages 
the raw waveform data in exactly the same manner as will the AMC13. 


Simulated waveform 



Figure 4. Simulated waveform for 
700 /iS muon fill as recorded by a 
800 MHz, 12 bit waveform digitizer. 
The pulses represent simulated 
decays of muons to positrons as 
seen in the calorimeter. 


The simulated waveforms can then be read and unpacked by the same MIDAS frontend that 
will read and unpack data from a physical AMC13 with a full compliment of 12 WFDs. Because 
the simulated waveform has a physical structure analogous to that which we expect from the 
WFDs, the CUDA algorithms for the T and Q methods can be fully implemented and tested 
using this method. 

Fig shows a sample of the processing times recorded for one run with this simulated data. 
The time stamp of each subsequent process is histogrammed vertically. The tcpAhread reads the 
data, transfers it to the GPU-thready which processes the data and sends it on to the MFE-thread. 
The greatest amount of time is spent reading the data from the TCP socket, followed by copying 
the data to the GPU memory. Much effort has taken place, and is still underway, to streamline 
this process as much as possible. Upgrading from PCIe version 2.0 to version 3.0 gave a vast 
improvement, as was shown earlier in Table. The size of the data is greatly reduced by the 
GPU processing, and thus the processes occurring after that happen much faster. 

The first readout by the DAQ of the new waveform digitizers that are being developed at 
Cornell occurred during a test-beam run at SLAG in late 2014, as described in [^. Tests were 
performed using a single-channel WFD, and later a 5-channel WFD that did not yet have the full 
implementation of it’s functionality, but data from the five Rider FPGA’s was passed through 
the /iTCA crate backplane to the AMC13, and read out by the data acquisition system over 10 
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Figure 5. Processing time of data 
in frontend computer. The vertical 
axis represents the executions of 
each subsequent thread, and the 
horizontal axis provides the time of 
execution. 


GbE. We plan to read five of the 5-channel Rider boards by mid-2015, and to have one fully 
functioning /iTCA shelf (so 12 WFD, and 60 channels) by the Fall of 2015. 

A test of the event builder was performed by sending block data from 24 frontends and 
assembling the event fragments at a full rate of 12 Hz. The FakeData frontend has the ability 
to manually scale the size of the data output, which enabled us to analyze the functionality of 
the event builder as a function of data size, as shown in Fig. [^and[^ The expected rate of of 
data that the experiment will write to disk after the GPU processing is expected to be 80-100 
MB/s, which the MIDAS event builder outperformed. The test was performed on older, less 
powerful computers than those that are being set up for the experiment, so we are confident 
that the event builder will be able to outdo even the displayed performance. 


DAQ Data Rate vs. Number of Frontends 
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Figure 6. Test of event 
building with 24 frontends. 
The horizontal axis is the 
number of frontend processes 
from 1 to 24. The blue squares 
represent the rate in events/s, 
the orange diamonds represent 
the rate in MB/s for each 
frontend process, and the blue 
triangles represent the total 
data rate. 





































Tptai Data Rate vs Data Volume 
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Figure 7. Test of event build¬ 
ing with 24 frontends run¬ 
ning with a varying volume of 
data.The saturation occurring 
above 112 MB/s is thought 
to be a limitation from the 
speed of memcpys that are be¬ 
ing performed in the frontend 
code. The event builder com¬ 
fortably outperforms our pro¬ 
jected derived data rate of 80 
MB/s. 


4. Conclusion 

A data acquisition system is being built for the new Muon g-2 experiment at Fermilab. The 
experiment plans to begin data taking in early 2017. The new experiment plans to collect 20 
times the statistics of the BNL experiment, which requires a new state-of-the-art acquisition 
system utilizing parallel data processing in a hybrid system of multi-core CPUs and GPUs. The 
DAQ will acquire data from 1296 channels of custom /iTCA waveform digitizers, as well as straw 
trackers and auxiliary detectors at a rate of 18 GB/s. Prototyping of the DAQ is underway, and 
construction will be complete by mid-2016. 
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